MATTEL DATA TEST

Supporting Materials

Missing data and outliers

The variables ‘PoolQC’, ‘MiscFeature’ and ‘Alley’ are missing more than 93% of data while the variable ‘Fence’ is missing about 80% of data. The variable ‘FireplaceQu’ is missing a liitle bit less than 50% of its data whereas the variable ‘LotFrontage’ is missing about 17% of its data. All other variables are either missing less that 5% of the data or don’t contain missing vallues.

Based on the data description, PoolQC is NaN when there is no pool, hence, I changed missing value to ‘No_Pool’. Similar approaches were applied to ‘Fence’, ‘FireplaceQu’,‘garage’, and ‘basement’related features. Next, variable mean was used to fill missing data for the ’LotFrontage’ and ‘MasVnrArea’ variables.

There are 4 outliers whose GrLivArea are above 4000 square ft.

After filling missing value and delete outliers, I split the dataset into traing and testing sets using 0.75 partition.


Data Exploration

For categorical variable, check distribution of SalePrice with respect to variable values, only applied to training set.

For numerical variable, check correlation between SalePrice and variable values, only applied to training set.

Combine numerical and categorical to evaluate feature importance, only applied to training set.
Recursive Feature Elimination (RFE) method also identified 16 candidate variables: “GrLivArea”, “Neighborhood”, “OverallQual”, “BsmtFinSF1”, “TotalBsmtSF”, “1stFlrSF”, “GarageArea”, “2ndFlrSF”, “GarageCars”, “LotArea”, “ExterQual”, “GarageType”, “FireplaceQu”, “BsmtFinType1”, “KitchenQual”, and “OverallCond”.

Feature Engineering and Model Selection

Prediction